Title: Parkinsons Disease Data Set
Abstract: Oxford Parkinson's Disease Detection Dataset
Source:
The dataset was created by Max Little of the University of Oxford, in collaboration with the National Centre for Voice and Speech, Denver, Colorado, who recorded the speech signals. The original study published the feature extraction methods for general voice disorders.
Data Description & Context:
Parkinson’s Disease (PD) is a degenerative neurological disorder marked by decreased dopamine levels in the brain. It manifests itself through a deterioration of movement, including the presence of tremors and stiffness. There is commonly a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range). Additionally, cognitive impairments and changes in mood can occur, and risk of dementia is increased.
Traditional diagnosis of Parkinson’s Disease involves a clinician taking a neurological history of the patient and observing motor skills in various situations. Since there is no definitive laboratory test to diagnose PD, diagnosis is often difficult, particularly in the early stages when motor effects are not yet severe. Monitoring progression of the disease over time requires repeated clinic visits by the patient. An effective screening process, particularly one that doesn’t require a clinic visit, would be beneficial. Since PD patients exhibit characteristic vocal features, voice recordings are a useful and non-invasive tool for diagnosis. If machine learning algorithms could be applied to a voice recording dataset to accurately diagnosis PD, this would be an effective screening step prior to an appointment with a clinician
Attribute Information:
name - ASCII subject name and recording numberMDVP:Fo(Hz) - Average vocal fundamental frequencyMDVP:Fhi(Hz) - Maximum vocal fundamental frequencyMDVP:Flo(Hz) - Minimum vocal fundamental frequencyMDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequencyMDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitudeNHR,HNR - Two measures of ratio of noise to tonal components in the voicestatus - Health status of the subject (one) - Parkinson's, (zero) - healthyRPDE,D2 - Two nonlinear dynamical complexity measuresDFA - Signal fractal scaling exponentspread1,spread2,PPE - Three nonlinear measures of fundamental - frequency variationObjective:
Goal is to classify the patients into the respective labels using the attributes from their voice recordings.
Steps and tasks:
#Numerical Calculations
import numpy as np
import pandas as pd
from scipy.stats import norm, shapiro
#Data Visualization
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
from sklearn import tree
from sklearn.tree import export_graphviz
from IPython.display import Image
#Data Preprocessing
from sklearn.preprocessing import RobustScaler
#Train and Test preparation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, StratifiedKFold
#Model Building - Classifiers
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
#Model Building - Ensembles
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, StackingClassifier
import xgboost as xgb
from lightgbm import LGBMClassifier, plot_importance
#Model Evaluation
from sklearn.metrics import confusion_matrix, classification_report, roc_curve
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, precision_score, recall_score, log_loss
from sklearn.metrics import plot_roc_curve, plot_confusion_matrix, plot_precision_recall_curve
from sklearn.inspection import plot_partial_dependence
#Misc
import os
import gc
import warnings
warnings.filterwarnings('ignore')
Loading the dataset and checking the first 10 records
parkinsons = pd.read_csv("/content/drive/My Drive/Colab Notebooks/Datasets/Data - Parkinsons")
parkinsons.head(10)
Checking the Features / Variables
parkinsons.columns
Checking the shape of the dataset
print('No of Rows : ', parkinsons.shape[0])
print('No of Columns : ', parkinsons.shape[1])
Checking the datatypes and null records
parkinsons.info()
Checking total number of Null values in the dataset
parkinsons.isnull().any().sum()
Observations:
#Checking if the dataset has only Numeric Data
parkinsons.applymap(np.isreal).all()
Re-ordering the target variable 'status' as the last column for our convenience
temp = parkinsons['status']
parkinsons.drop(['status'], axis = 1, inplace=True)
parkinsons['status'] = temp
parkinsons.head()
parkinsons.describe().T
Observations from the Statistical Summary:
Scale of all the variables are different, hence it becomes imperative to use Scaler transformation before Model Building.
MDVP:Fo(Hz) :
MDVP:Fhi(Hz) :
MDVP:Flo(Hz) :
MDVP:Jitter(%) :
MDVP:Jitter(Abs) :
MDVP:RAP :
MDVP:PPQ :
Jitter:DDP :
MDVP:Shimmer :
MDVP:Shimmer(dB) :
Shimmer:APQ3 :
Shimmer:APQ5 :
MDVP:APQ :
Shimmer:DDA :
NHR :
HNR :
RPDE :
DFA :
spread1 :
spread2 :
D2 :
PPE :
Dropping name from the dataframe.
parkinsons = parkinsons.drop(columns='name')
parkinsons.head()
#Segregating the Numeric Predictors for plotting
num_cols = ['MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Jitter(%)',
'MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP',
'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5',
'MDVP:APQ', 'Shimmer:DDA', 'NHR', 'HNR', 'RPDE', 'DFA',
'spread1', 'spread2', 'D2', 'PPE']
def plot_univariate_features(df):
"""
Helper function to plot Univariate features.
Input : Dataframe; Output : All Univariate plots for both Numeric and Categorical variables.
"""
print("Integer Columns = ",df.select_dtypes(include=['int32','int64']).columns)
print("Floating Point Columns = ",df.select_dtypes(include=['float64']).columns)
print("Object Columns = ",df.select_dtypes(include=['object']).columns)
print("Category Columns = ",df.select_dtypes(include=['category']).columns)
#sns.set_style(style='darkgrid')
int_cols = pd.Series(df.select_dtypes(include=['int32','int64']).columns)
for j in range(0,len(int_cols)):
f, axes = plt.subplots(1, 2, figsize=(10, 10))
sns.boxplot(df[int_cols[j]], ax = axes[0], palette='Greens_r')
sns.distplot(df[int_cols[j]], ax = axes[1], fit=norm)
plt.subplots_adjust(top = 1.5, right = 10, left = 8, bottom = 1)
float_cols = pd.Series(df.select_dtypes(include=['float64']).columns)
for j in range(0,len(float_cols)):
plt.Text('Figure for float64')
f, axes = plt.subplots(1, 2, figsize=(10, 10))
sns.boxplot(df[float_cols[j]], ax = axes[0], palette='Greens_r')
sns.distplot(df[float_cols[j]], ax = axes[1], fit=norm)
plt.subplots_adjust(top = 1.5, right = 10, left = 8, bottom = 1)
obj_cols = pd.Series(df.select_dtypes(include=['object']).columns)
for j in range(0,len(obj_cols)):
plt.subplots()
sns.countplot(df[obj_cols[j]])
plot_univariate_features(parkinsons[num_cols])
#f = pd.melt(parkinsons, value_vars=cols)
#g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, height=4, aspect=2)
#g = g.map(sns.distplot, "value")
List of features with Multiple Gausssian Mixtures: (Multimodal Distributions)
MDVP:Fhi(Hz)MDVP:RAPMDVP:PPQShimmer:APQ5HNRRPDEDFAnormality_test = lambda x: shapiro(x.fillna(0))[1] < 0.01
normal = parkinsons[num_cols]
normal = normal.apply(normality_test)
print(~normal)
non_normal_cols = ['MDVP:Fo(Hz)','MDVP:Fhi(Hz)','MDVP:Flo(Hz)','MDVP:Jitter(%)',
'MDVP:Jitter(Abs)','MDVP:RAP','MDVP:PPQ','Jitter:DDP','MDVP:Shimmer',
'MDVP:Shimmer(dB)','Shimmer:APQ3','Shimmer:APQ5','MDVP:APQ','Shimmer:DDA','NHR','HNR','RPDE','PPE']
non_normal_cols
plt.figure(figsize=(10,10))
parkinsons[num_cols].skew().sort_values().plot(kind='barh')
plt.show()
def outliers_IQR(df):
"""
Helper function to detect Outliers in the Dataframe
Input : Dataframe; Output : Dataframe containing Percentage and Number of Outliers.
"""
columns = df.columns
outliers_list = []
no_of_outliers = []
for c in df.columns:
Q1 = df[c].quantile(0.25)
Q3 = df[c].quantile(0.75)
IQR = Q3 - Q1
df_outliers = np.where((df[c] < (Q1 - 1.5 * IQR)) | (df[c] > (Q3 + 1.5 * IQR)))
no_of_outliers.append(len(df_outliers[0]))
outliers_list.append(round((len(df_outliers[0]) / len(df[c]) * 100),2))
print("\n")
outliers_df = pd.DataFrame({"Percentage_of_Outliers":outliers_list, "No_of_Outliers":no_of_outliers}, index=df.columns)
return outliers_df.sort_values(by="Percentage_of_Outliers", ascending=False)
outliers_IQR(parkinsons[non_normal_cols])
Imputing the Outliers
# Removing outliers (data points beyond 1.5xIQR)
for col in non_normal_cols:
Q1 = parkinsons[col].quantile(0.25)
Q3 = parkinsons[col].quantile(0.75)
IQR = Q3 - Q1
lower_lim = Q1 - 1.5*IQR
upper_lim = Q3 + 1.5*IQR
parkinsons.loc[(parkinsons[col] > upper_lim), col] = np.nan
parkinsons.loc[(parkinsons[col] < lower_lim), col] = np.nan
print('Outliers are set to NaN')
print('Number of rows impacted due to Outlier treatment : ', parkinsons.isnull().any().sum())
#Imputing the Outlier NaNs with Median values
parkinsons[non_normal_cols] = parkinsons[non_normal_cols].apply(lambda x: x.fillna(x.median()),axis=0)
parkinsons.sample(3)
parkinsons.isnull().any().sum() #Sanity check post imputation
status¶parkinsons['status'].value_counts()
sns.countplot(y='status', data=parkinsons);
colors = ["lightgreen", "silver"]
labels = ['Yes', 'No']
plt.pie(parkinsons['status'].value_counts(), labels = labels, colors = colors, autopct='%1.1f%%', shadow=True)
plt.title("Distribution of Parkinsons patients in the dataset")
plt.show()
Around 3/4th of the records are diagnosed with Parkinsons. This is mainly because every patient has 6 records in the given dataset, which is inducing 'Sample Bias'.
However, since the objective of this dataset is to identify Parkinson's in early stages, we are interested in the True Positive Rate more than the Negative Records (which will not be as costly as misclassifying a Parkinson's patient as healthy).
Hence, we will not do any resampling for the Class Imbalance and will be looking at additional metrics such as ROC AUC, F1-Score, Recall etc in addition to the Accuracy.
def plot_predictor_target_features(df):
"""
Helper function to plot multivariate features.
Input : Dataframe; Output : One-to-Many Multivariate plots for Predictors vs Target feature.
"""
print("All Columns = ",df.columns)
predictor_cols = pd.Series(df.columns)
for j in range(0,len(predictor_cols)):
f, axes = plt.subplots(1, 2, figsize=(10, 10))
sns.boxplot(y=df[predictor_cols[j]], x=parkinsons['status'], ax = axes[0], palette='Accent')
sns.swarmplot(y=df[predictor_cols[j]], x=parkinsons['status'], ax = axes[0], palette='Dark2')
sns.violinplot(y=df[predictor_cols[j]], x=parkinsons['status'], ax = axes[1], palette='YlGn')
plt.subplots_adjust(top = 1.5, right = 10, left = 8, bottom = 1)
plot_predictor_target_features(parkinsons[num_cols])
plt.figure(figsize=(20,20))
sns.pairplot(parkinsons, hue='status', diag_kind = 'kde')
plt.show()
plt.figure(figsize=(14,14), dpi=300)
sns.heatmap(parkinsons.corr(), annot=True, linewidths = 0.3, fmt = '0.2f', cmap = 'YlGnBu', square=True)
plt.show()
status¶plt.figure(figsize=(15,15))
parkinsons.corr()[['status']].sort_values(by='status', ascending=True).plot(kind='barh')
plt.show()
#Identifying highly correlated Independent variables (Predictors)
high_corr_var = parkinsons.corr().abs().unstack() #abs() considers absolute values of correlation coefficients.
high_corr_var = high_corr_var.sort_values(kind = "quicksort", ascending = False)
print('No of highly correlated instances : ',len(high_corr_var[(high_corr_var > 0.8) & (high_corr_var < 1)]))
print("\n")
print("Percentage of multicollinear variables : ", (len(high_corr_var[(high_corr_var > 0.8) & (high_corr_var < 1)]) / len(parkinsons) * 100))
print("\n")
print(high_corr_var[(high_corr_var > 0.8) & (high_corr_var < 1)])
MDVP:Fhi(Hz), DFA, MDVP:Fo(Hz), MDVP:Flo(Hz), RPDE, spread2, D2 have weak correlation with other predictors. PPE, spread1, MDVP:APQ, MDVP:Shimmer, MDVP:Shimmer(dB), Shimmer:APQ5, MDVP:PPQ, Jitter:DDP, MDVP:RAP, MDVP:Jitter(Abs), Shimmer:APQ3, MDVP:Jitter(%), Shimmer:DDA, NHR exhibit multicollinearity, which will be further checked by Variance Inflation Factor (VIF) in the next section. HNR exhibits negative correlation with the target variable status.def calculate_vif(X):
vif_features = pd.DataFrame()
vif_features["Features"] = X.columns
vif_features["VIF Score"] = [vif(X.values, i) for i in range(X.shape[1])]
return(vif_features)
vif_df = calculate_vif(parkinsons[num_cols])
vif_df.sort_values(by='VIF Score', ascending=False)
vif_df[vif_df['VIF Score']<10].sort_values(by='VIF Score', ascending=False)
X = parkinsons[num_cols]
y = parkinsons['status']
#Stratify is used to preserve proportion of target classes in splits, since the Class is Imbalanced
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state = 24)
display(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
#Outliers in our dataset will be better handled by RobustScaler()
scaler = RobustScaler()
#Fitting the Training set
scaler.fit(X_train)
Note: Fit using X_train and transform both X_train and X_test. It is better to construct Pipelines for such Preprocessing steps, though it is out of scope of this current assignment.
scaler.transform(X_train)
scaler.transform(X_test)
Note:
def classifier_metrics(classifier):
"""
Helper function to plot Univariate features.
Input : Classifier;
Output : Accuracy, Classification Report, Confusion Matrix, ROC AUC Curve, Precision-Recall Curve
"""
y_true, y_pred = y_test, classifier.predict(X_test)
train_accuracy = classifier.score(X_train, y_train).round(3)
test_accuracy = classifier.score(X_test, y_test).round(3)
print("Train Accuracy : ", train_accuracy)
print("Test Accuracy : ", test_accuracy)
# Classification Report
print("\nClassification Report")
print('\n{}'.format(classification_report(y_true, y_pred)))
# Precision Score
clf_precision = precision_score(y_true, y_pred)
print('\nPrecision Score:\n', clf_precision.round(3))
# Recall Score
clf_recall = recall_score(y_true, y_pred)
print('\nRecall Score:\n', clf_recall.round(3))
# F1 Score
clf_f1 = f1_score(y_true, y_pred)
print('\nF1 Score:\n', clf_f1.round(3))
# ROC AUC Score
clf_roc_auc = roc_auc_score(y_true, y_pred)
print('\nROC AUC Score:\n', clf_roc_auc.round(3))
#Gini-Coefficient
clf_gini_coeff = 2 * clf_roc_auc -1
print("\nGini Coefficient :\n", clf_gini_coeff.round(2))
# Confusion Matrix
clf_conf_matrix = confusion_matrix(y_true, y_pred)
print('\nConfusion Matrix:\n')
sns.heatmap(clf_conf_matrix, annot=True, square=True, fmt='g')
#ROC AUC Curve
clf_fpr, clf_tpr, clf_thresholds = roc_curve(y_true, classifier.predict_proba(X_test)[:,1])
plt.figure(figsize = (14 , 6))
plt.plot(clf_fpr, clf_tpr, label = 'Classifier AUC (area = {})'.format(clf_roc_auc.round(2)))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc = 'lower right')
plt.show()
#Precision-Recall Curve
plt.figure(figsize=(14,6))
plot_precision_recall_curve(classifier, X_test, y_test)
plt.title("Precision-Recall AUC Curve")
plt.legend(loc = 'lower right')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.show()
return train_accuracy, test_accuracy, clf_conf_matrix, clf_precision, clf_recall, clf_f1, clf_roc_auc, clf_gini_coeff,clf_fpr, clf_tpr, clf_thresholds
SVM with Polynomial Kernel
svm_clf = SVC(kernel = 'poly', C = 100, gamma='auto', probability=True)
svm_clf.fit(X_train, y_train)
svm_train_accuracy, svm_test_accuracy, svm_clf_conf_matrix, svm_clf_precision, svm_clf_recall, svm_clf_f1, svm_clf_roc_auc, svm_clf_gini_coeff, svm_clf_fpr, svm_clf_tpr, svm_clf_thresholds = classifier_metrics(svm_clf)
SVM with Radial Basis Kernel
svm_clf_rbf = SVC(kernel = 'rbf', C = 100, gamma='auto', probability=True)
svm_clf_rbf.fit(X_train, y_train)
svm_train_accuracy, svm_test_accuracy, svm_clf_conf_matrix, svm_clf_precision, svm_clf_recall, svm_clf_f1, svm_clf_roc_auc, svm_clf_gini_coeff, svm_clf_fpr, svm_clf_tpr, svm_clf_thresholds = classifier_metrics(svm_clf_rbf)
log_reg_clf = LogisticRegression(solver='lbfgs', n_jobs=-1)
log_reg_clf.fit(X_train, y_train)
lr_train_accuracy, lr_test_accuracy, lr_clf_conf_matrix, lr_clf_precision, lr_clf_recall, lr_clf_f1, lr_clf_roc_auc, lr_clf_gini_coeff, lr_clf_fpr, lr_clf_tpr, lr_clf_thresholds = classifier_metrics(log_reg_clf)
Logistic Regression, being a Parametric Method has the closed Train and Test Accuracy Scores, showing that the model hasn't overfitted.
gnb_clf = GaussianNB()
gnb_clf.fit(X_train, y_train)
gnb_train_accuracy, gnb_test_accuracy, gnb_clf_conf_matrix, gnb_clf_precision, gnb_clf_recall, gnb_clf_f1, gnb_clf_roc_auc, gnb_clf_gini_coeff, gnb_clf_fpr, gnb_clf_tpr, gnb_clf_thresholds = classifier_metrics(gnb_clf)
Given dataset has lot of multicollinear features, which defeats the core 'Naive' assumption of Gaussian Naive Bayes algorithm. This is visible from the very-average results displayed by the model.
dt_clf = DecisionTreeClassifier(criterion='entropy', random_state=24)
dt_clf.fit(X_train, y_train)
dt_train_accuracy, dt_test_accuracy, dt_clf_conf_matrix, dt_clf_precision, dt_clf_recall, dt_clf_f1, dt_clf_roc_auc, dt_clf_gini_coeff, dt_clf_fpr, dt_clf_tpr, dt_clf_thresholds = classifier_metrics(dt_clf)
text_representation = tree.export_text(dt_clf)
print(text_representation)
import graphviz
dot_data = tree.export_graphviz(dt_clf, out_file=None,
feature_names=X.columns,
class_names=['No', 'Yes'],
filled=True)
graph = graphviz.Source(dot_data, format="png")
graph
! pip install dtreeviz
from dtreeviz.trees import dtreeviz # remember to load the package
viz = dtreeviz(dt_clf, X, y,
target_name="Status",
feature_names=X.columns,
class_names=['No', 'Yes'])
viz
stk_clf = StackingClassifier(estimators=[('SVC', svm_clf_rbf), ('LogR', log_reg_clf), ('GNB', gnb_clf), ('DT', dt_clf)],
final_estimator=dt_clf)
#DecisionTreeClassifier() is used as the final_estimator for this StackingClassifier.
stk_clf.fit(X_train, y_train)
stk_train_accuracy, stk_test_accuracy, stk_clf_conf_matrix, stk_clf_precision, stk_clf_recall, stk_clf_f1, stk_clf_roc_auc, stk_clf_gini_coeff, stk_clf_fpr, stk_clf_tpr, stk_clf_thresholds = classifier_metrics(stk_clf)
Let us start with a Bagging Ensemble - Random Forest!
rf_clf = RandomForestClassifier(n_estimators=100, criterion='entropy', n_jobs=-1, oob_score=True, random_state=24)
rf_clf.fit(X_train, y_train)
print("Out-of-bag Evaluation Score for Random Forest : ", rf_clf.oob_score_.round(3))
rf_train_accuracy, rf_test_accuracy, rf_clf_conf_matrix, rf_clf_precision, rf_clf_recall, rf_clf_f1, rf_clf_roc_auc, rf_clf_gini_coeff, rf_clf_fpr, rf_clf_tpr, rf_clf_thresholds = classifier_metrics(rf_clf)
def plot_feature_rank(clf):
feature_rank = pd.DataFrame({
'feature' : X_train.columns,
'importance' : clf.feature_importances_.round(3)
}
)
feature_rank = feature_rank.sort_values('importance', ascending=False)
print("Amount of variance explained by the predictors\n")
feature_rank['cumulative_sum'] = feature_rank['importance'].cumsum()*100
plt.figure(figsize=(8,6))
sns.barplot(y='feature', x='importance', data=feature_rank)
plt.show()
return feature_rank
rf_feature_rank = plot_feature_rank(rf_clf)
rf_feature_rank
sns.lineplot(y=rf_feature_rank['cumulative_sum'], x=np.arange(0,22,1))
PPE feature is the most important feature as per RandomForestClassifier and DecisionTreeClassifier algorithms. HNR which has negative correlation with target variable is labeled as 2nd most important feature by DecisionTreeClassifier however, it is given moderate priority by RandomForestClassifier. rf_clf_1 = RandomForestClassifier(random_state=24, n_jobs=-1, oob_score=True)
rf_clf_1.fit(X_train, y_train)
param_grid = {
'n_estimators':[1,50,100,150,200],
'max_depth':[10,20,50,80,100]
}
rf_tuned_clf = GridSearchCV(rf_clf_1,param_grid=param_grid, scoring='roc_auc',cv=StratifiedKFold(n_splits=10),n_jobs=-1)
rf_tuned_clf.fit(X_train,y_train)
rf_tuned_clf.best_estimator_
print("Out-of-bag Evaluation Score for Random Forest Tuned : ", rf_tuned_clf.best_estimator_.oob_score_.round(3))
rf_tuned_clf.best_params_
rf_tuned_cv = pd.DataFrame(rf_tuned_clf.cv_results_)
rf_tuned_cv
rf_tuned_train_accuracy, rf_tuned_test_accuracy, rf_tuned_clf_conf_matrix, rf_tuned_clf_precision, rf_tuned_clf_recall, rf_tuned_clf_f1, rf_tuned_clf_roc_auc, rf_tuned_clf_gini_coeff, rf_tuned_clf_fpr, rf_tuned_clf_tpr, rf_tuned_clf_thresholds = classifier_metrics(rf_tuned_clf)
rf_tuned_feature_rank = plot_feature_rank(rf_tuned_clf.best_estimator_)
rf_tuned_feature_rank
sns.lineplot(y=rf_tuned_feature_rank['cumulative_sum'], x=np.arange(0,22,1))
So, RandomForest has provided spectacular results, as expected. Let us try with a Boosting Ensemble - AdaBoost (Adaptive Boosting)
ada_clf = AdaBoostClassifier(algorithm='SAMME.R', random_state=24)
#Default Estimator is DecisionTreeClassifier() with max_depth = 1 (Tree Stump)
ada_clf.fit(X_train, y_train)
ada_train_accuracy, ada_test_accuracy, ada_clf_conf_matrix, ada_clf_precision, ada_clf_recall, ada_clf_f1, ada_clf_roc_auc, ada_clf_gini_coeff, ada_clf_fpr, ada_clf_tpr, ada_clf_thresholds = classifier_metrics(ada_clf)
SVM, Logistic Regression and Random Forest Classifiers are provided as the different Base Estimators for the AdaBoost Classifier. Let us see which classifier is picked by the GridSearch.
param_grid = {
'base_estimator' : [svm_clf_rbf, log_reg_clf, rf_clf],
'n_estimators': [1,50,100,150,200],
'learning_rate': [0.01, 0.05, 0.1, 1.0]
}
ada_tuned_clf = GridSearchCV(AdaBoostClassifier(random_state=24),param_grid=param_grid, scoring='roc_auc',cv=StratifiedKFold(n_splits=10),n_jobs=-1)
ada_tuned_clf.fit(X_train, y_train)
ada_tuned_clf.best_params_
ada_tuned_train_accuracy, ada_tuned_test_accuracy, ada_tuned_clf_conf_matrix, ada_tuned_clf_precision, ada_tuned_clf_recall, ada_tuned_clf_f1, ada_tuned_clf_roc_auc, ada_tuned_clf_gini_coeff, ada_tuned_clf_fpr, ada_tuned_clf_tpr, ada_tuned_clf_thresholds = classifier_metrics(ada_tuned_clf)
So, using a Bagging Ensemble 'RandomForest' as a Base Estimator for a Boosting Ensemble 'AdaBoost' has yielded the BEST results so far out of all the models.
Let us try with the Gradient Boosting methods next.
Instantiating the XGBoost (Extreme Gradient Boosting) Classifier. We have used 'gbtree' which used Decision Trees as the base estimator. Also the objective is 'binary:logistic' which uses Cross-Entropy Log Loss for arriving at the classification predictions.
xgb_clf = xgb.XGBClassifier(
booster='gbtree',
objective='binary:logistic',
random_state=24,
n_jobs=-1)
Now, let us fit the XGB model. Note, that we have used early_stopping_rounds = 10, which means that the model will NOT grow the entire tree and prune it back (unlike Random Forest) and the tree growth will stop if there is no improvement in the eval_metric for 10 rounds (epochs).
xgb_clf.fit(
X_train,
y_train,
verbose=True,
early_stopping_rounds=10, #Stop the Tree Growth when there is no visible improvement for 10 iterations
eval_metric='auc', #ROC AUC Curve
eval_set=[(X_test, y_test)]
)
xgb_clf.best_score
xgb_train_accuracy, xgb_test_accuracy, xgb_clf_conf_matrix, xgb_clf_precision, xgb_clf_recall, xgb_clf_f1, xgb_clf_roc_auc, xgb_clf_gini_coeff, xgb_clf_fpr, xgb_clf_tpr, xgb_clf_thresholds = classifier_metrics(xgb_clf)
xgb_estimator = xgb.XGBClassifier(
objective='binary:logistic',
seed=24,
random_state=24
)
param_grid = {
'max_depth' : [3,4,5,6], #Tree Depth
'learning_rate' : [0.1,0.01,0.05], #Shrinkage Parameter to avoid being Overfit
'gamma' : [0,0.25,1.0], #Regularization Parameter
'reg_lambda' : [0,1.0,10.0], #Regularization Parameter
'scale_pos_weight' : [1,3] #For Class Imbalance handling
}
xgb_tuned_clf = GridSearchCV(xgb_estimator,param_grid=param_grid, scoring='roc_auc',cv=StratifiedKFold(n_splits=10),n_jobs=-1)
xgb_tuned_clf.fit(
X_train,
y_train,
verbose=True,
early_stopping_rounds=10, #Stop the Tree Growth when there is no visible improvement for 10 iterations
eval_metric='auc', #ROC AUC Score
eval_set=[(X_test, y_test)]
)
xgb_tuned_clf.best_params_
xgb_tuned_clf.best_estimator_
xgb_tuned_clf.best_score_
xgb_tuned_train_accuracy, xgb_tuned_test_accuracy, xgb_tuned_clf_conf_matrix, xgb_tuned_clf_precision, xgb_tuned_clf_recall, xgb_tuned_clf_f1, xgb_tuned_clf_roc_auc, xgb_tuned_clf_gini_coeff, xgb_tuned_clf_fpr, xgb_tuned_clf_tpr, xgb_tuned_clf_thresholds = classifier_metrics(xgb_tuned_clf)
xgb_lone_estimator = xgb.XGBClassifier(
objective='binary:logistic',
seed=24,
subsample=0.9,
colsample_bytree=0.5,
gamma = 0.25,
learning_rate=0.05,
max_depth=4,
reg_lambda=10.0,
scale_pos_weight=3,
n_estimator=1 #Single Tree for Visualisation purpose
)
xgb_lone_estimator.fit(X_train, y_train)
xgb_bst = xgb_lone_estimator.get_booster()
for importance_type in ('weight', 'gain', 'cover', 'total_gain', 'total_cover'):
print('%s: ' % importance_type, xgb_bst.get_score(importance_type=importance_type))
node_params = {
'shape' : 'box',
'style' : 'filled, rounded',
'fillcolor' : '#78cbe'
}
leaf_params = {
'shape' : 'box',
'style' : 'filled',
'fillcolor' : '#e48038'
}
xgb.to_graphviz(xgb_lone_estimator,
num_trees=0,
size="5,5",
condition_node_params=node_params,
leaf_node_params=leaf_params
)
xgb_feature_rank = plot_feature_rank(xgb_tuned_clf.best_estimator_)
xgb_feature_rank
sns.lineplot(y=xgb_feature_rank['cumulative_sum'], x=np.arange(0,22,1))
lgbm_clf = LGBMClassifier(
boosting_type='gbdt', #Gradient Boosting Decision Trees
objective='binary',
n_estimators=100,
learning_rate=0.5,
is_unbalanced=True,
n_jobs=-1,
random_state=24
)
lgbm_clf.fit(X_train, y_train)
lgbm_train_accuracy, lgbm_test_accuracy, lgbm_clf_conf_matrix, lgbm_clf_precision, lgbm_clf_recall, lgbm_clf_f1, lgbm_clf_roc_auc, lgbm_clf_gini_coeff, lgbm_clf_fpr, lgbm_clf_tpr, lgbm_clf_thresholds = classifier_metrics(lgbm_clf)
Now, let us try out (greedy) Stacking with Random Forest, XGBoost and LightGBM models!
greedy_stk_clf = StackingClassifier(estimators=[('RF', rf_tuned_clf), ('XBG', xgb_tuned_clf), ('LGBM', lgbm_clf)], final_estimator=xgb_tuned_clf)
greedy_stk_clf.fit(X_train, y_train)
greedy_stk_train_accuracy, greedy_stk_test_accuracy, greedy_stk_clf_conf_matrix, greedy_stk_clf_precision, greedy_stk_clf_recall, greedy_stk_clf_f1, greedy_stk_clf_roc_auc, greedy_stk_clf_gini_coeff, greedy_stk_clf_fpr, greedy_stk_clf_tpr, greedy_stk_clf_thresholds = classifier_metrics(greedy_stk_clf)
List of Model Evaluation Metrics:
Note:
We will look at all the parameters due to the inherent Class Imbalance in the given dataset. However, since the model is going to be used as a preliminary investigation step for a Clinic Appointment, we are more interested in reducing the 'False Negatives' and hence 'Recall' becomes one of the first metrics to be evaluated in relation to other metrics.
metrics = {'Classifier':['Logistic Regression',
'Gaussian Naive Bayes',
'Support Vector Machine',
'Decision Trees',
'Random Forest',
'RF Hyperparameter Tuned',
'AdaBoost',
'AdaBoost Hyperparameter Tuned',
'Stacking Classifier',
'XGBoost',
'XGB Hyperparameter Tuned',
'LightGBM'],
'Accuracy' : [lr_test_accuracy,
gnb_test_accuracy,
svm_test_accuracy,
dt_test_accuracy,
rf_test_accuracy,
rf_tuned_test_accuracy,
ada_test_accuracy,
ada_tuned_test_accuracy,
stk_test_accuracy,
xgb_test_accuracy,
xgb_tuned_test_accuracy,
lgbm_test_accuracy],
'Precision':[lr_clf_precision,
gnb_clf_precision,
svm_clf_precision,
dt_clf_precision,
rf_clf_precision,
rf_tuned_clf_precision,
ada_clf_precision,
ada_tuned_clf_precision,
stk_clf_precision,
xgb_clf_precision,
xgb_tuned_clf_precision,
lgbm_clf_precision],
'Recall':[lr_clf_recall,
gnb_clf_recall,
svm_clf_recall,
dt_clf_recall,
rf_clf_recall,
rf_tuned_clf_recall,
ada_clf_recall,
ada_tuned_clf_recall,
stk_clf_recall,
xgb_clf_recall,
xgb_tuned_clf_recall,
lgbm_clf_recall],
'F1-Score':[lr_clf_f1,
gnb_clf_f1,
svm_clf_f1,
dt_clf_f1,
rf_clf_f1,
rf_tuned_clf_f1,
ada_clf_f1,
ada_tuned_clf_f1,
stk_clf_f1,
xgb_clf_f1,
xgb_tuned_clf_f1,
lgbm_clf_f1],
'ROC AUC':[lr_clf_roc_auc,
gnb_clf_roc_auc,
svm_clf_roc_auc,
dt_clf_roc_auc,
rf_clf_roc_auc,
rf_tuned_clf_roc_auc,
ada_clf_roc_auc,
ada_tuned_clf_roc_auc,
stk_clf_roc_auc,
xgb_clf_roc_auc,
xgb_tuned_clf_roc_auc,
lgbm_clf_roc_auc],
'Gini Coeff':[lr_clf_gini_coeff,
gnb_clf_gini_coeff,
svm_clf_gini_coeff,
dt_clf_gini_coeff,
rf_clf_gini_coeff,
rf_tuned_clf_gini_coeff,
ada_clf_gini_coeff,
ada_tuned_clf_gini_coeff,
stk_clf_gini_coeff,
xgb_clf_gini_coeff,
xgb_tuned_clf_gini_coeff,
lgbm_clf_gini_coeff]
}
model_eval_metrics = pd.DataFrame(metrics)
model_eval_metrics = model_eval_metrics.set_index('Classifier')
model_eval_metrics
plt.figure(figsize = (14 , 8))
plt.plot(lr_clf_fpr, lr_clf_tpr, label = 'Logistic Regression (area = {})'.format(lr_clf_roc_auc.round(2)))
plt.plot(gnb_clf_fpr, gnb_clf_tpr, label = 'Naive Bayes (area = {})'.format(gnb_clf_roc_auc.round(2)))
plt.plot(svm_clf_fpr, svm_clf_tpr, label = 'SVM (area = {})'.format(svm_clf_roc_auc.round(2)))
plt.plot(dt_clf_fpr, dt_clf_tpr, label = 'Decision Tree Classifier (area = {})'.format(dt_clf_roc_auc.round(2)))
plt.plot(rf_tuned_clf_fpr, rf_tuned_clf_tpr, label = 'Random Forest Tuned (area = {})'.format(rf_tuned_clf_roc_auc.round(2)))
plt.plot(ada_tuned_clf_fpr, ada_tuned_clf_tpr, label = 'AdaBoost Tuned (area = {})'.format(ada_tuned_clf_roc_auc.round(2)))
plt.plot(xgb_tuned_clf_fpr, xgb_tuned_clf_tpr, label = 'XGBoost Tuned (area = {})'.format(xgb_tuned_clf_roc_auc.round(2)))
plt.plot(lgbm_clf_fpr, lgbm_clf_tpr, label = 'LightGBM (area = {})'.format(lgbm_clf_roc_auc.round(2)))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC AUC - Classifiers Comparison')
plt.legend(loc = 'lower right')
plt.show()
print("Best Performance Model :\n")
model_eval_metrics.idxmax()
plt.figure(figsize=(12,12))
sns.heatmap(model_eval_metrics, annot=True, linewidths = 0.3, fmt='.2f', cmap='YlGnBu').set_title('Classifier Dashboard')
plt.show()
Given dataset is trained and fitted with the following Classifiers:
Standalone Classifiers:
Ensembles:
Meta-Classifier with Logistic Regression, Gaussian Naive Bayes and Support Vector Machines stacked up.
Random Forest Classifier (Bagging Ensemble)
Finally, all the classifier metrics are compared and evaluated using:
Inference from Classifier Comparison:
Since the objective of this entire exercise is to build a model that will predict Parkinsons's which will be subsequently used as a preliminary alert check for the patient to book a doctor's appointment, False Positives will not have much impact on the results (doctor's appointment will resolve it in next steps), However, we are more interested in reducing the False Negatives, therefore we go for models that provide higher True Positive Rate (Recall score = TP/(TP+FN)). Hence, 'AdaBoost with Random Forest as Base Estimator' and 'XGBoost' models are recommended here with the thresholds set using ROC AUC & Precision-Recall Curve.